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The fctli-nearest neighbor rule is arguably the simplest and most 
intuitively appealing nonparametric classification procedure. How- 
ever, application of this method is inhibited by lack of knowledge 
about its properties, in particular, about the manner in which it is 
influenced by the value of k; and by the absence of techniques for em- 
pirical choice of k. In the present paper we detail the way in which 
the value of k determines the misclassification error. We consider two 
models, Poisson and Binomial, for the training samples. Under the 
first model, data are recorded in a Poisson stream and are "assigned" 
to one or other of the two populations in accordance with the prior 
probabilities. In particular, the total number of data in both training 
samples is a Poisson-distributed random variable. Under the Binomial 
model, however, the total number of data in the training samples is 
fixed, although again each data value is assigned in a random way. Al- 
though the values of risk and regret associated with the Poisson and 
Binomial models are diff'erent, they are asymptotically equivalent to 
first order, and also to the risks associated with kernel-based classi- 
fiers that are tailored to the case of two derivatives. These properties 
motivate new methods for choosing the value of k. 

1. Introduction. In the classification or discrimination problem with two 
populations, denoted by X and Y, one wishes to classify an observation z to 
either X or Y using only training data. The fcth-nearest neighbor classifica- 
tion rule is arguably the simplest and most intuitively appealing nonpara- 
metric classifier. It assigns z to population X if at least of the k values in 
the pooled training-data set nearest to z are from X, and to population Y 
otherwise. The first study of this method was undertaken by Fix and Hodges 
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(1951). Since then there have been many investigations into the method's 
statistical properties. Little is known about the structure of its error proba- 
bilities, however, and neither are formulae available for optimal choice of k. 
Practical methods for optimal empirical choice of k have apparently not 
been given. 

The present paper resolves these issues, and focuses on expansions of 
the error rate of feth-nearest neighbor classifiers which are associated with 
optimal choice of k. We show that the values of risk of nearest-neighbor 
classifiers can be represented quite simply in terms of properties of the two 
populations, and that this leads to new, practical ways of choosing the value 
oik. 

The sizes of the training samples used to construct classifiers might fairly 
be viewed as random variables. Consider, for example, the case where a 
classifier is used by a bank to determine, from the bank's data, whether a new 
customer is likely to default on a loan. The sizes of the two training samples 
could be the number, M, of previous customers who defaulted, and the 
number, N , who did not default, respectively. An appropriate model for the 
distributions of M and might be that they are statistically independent 
and Poisson, with means ^ and v, say. For example, the Poisson sample-size 
model could arise if the population of potential customers were much larger 
than the number of customers who sought loans from the bank. 

Thus, Poisson rather than deterministic models for training-sample sizes 
can be motivated. Here, the total number of data in the two training samples 
is random, and data in a Poisson stream are "assigned" to one or other 
of the two populations using a formula which is based on the respective 
prior probabilities. A different approach, which gives rise to a Binomial- 
type model, involves the total number of training data being pre-determined, 
but apportions these data among the two populations in a manner similar 
to the Poisson model. We shall show that these two approaches produce 
nearest-neighbor classifiers with risks that are different but are nevertheless 
first-order equivalent. 

For fixed k the risk of a /c-nearest neighbor classifier converges to its 
limit relatively quickly, at rate T~^, as total sample size, T, increases [Cover 
(1968)]. However, the limiting value is strictly larger than the Bayes risk of 
the "ideal" classifier that would be used if both population densities were 
known. By way of comparison, in the case of imperfect information about 
the population, and in particular, in parametric settings, the risk of empir- 
ical Bayes classifiers converges to the Bayes risk no more rapidly than T"^; 
see Kharin and Duchinskas (1979). In nonparametric settings the rate of 
convergence to Bayes risk is slower still, but may nevertheless be asymptoti- 
cally optimal; see, for example, Marron (1983) and Mammen and Tsybakov 
(1999). 
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In most previous work on nearest-neighbor classifiers, the value of k was 
held fixed. Cover and Hart (1967) gave upper bounds for the limit of the 
risk of nearest-neighbor classifiers. Wagner (1971) and Fritz (1975) treated 
convergence of the conditional error rate when k = 1. Devroye and Wag- 
ner (1977, 1982) developed and discussed theoretical properties, particu- 
larly issues of mathematical consistency, for A;-nearest-neighbor rules. De- 
vroye (1981) found an asymptotic bound for the regret with respect to the 
Bayes classifier. Devroye et al. (1994) gave a particularly general descrip- 
tion of strong consistency for nearest-neighbor methods. Psaltis, Snapp and 
Venkatesh (1994) generalized the results of Cover (1968) to general dimen- 
sion, and Snapp and Venkatesh (1998) further extended the results to the 
case of multiple classes. Bax (2000) gave probabilistic bounds for the con- 
ditional error rate in the case where k = 1. Kulkarni and Posner (1995) 
addressed nearest-neighbor methods for quite general dependent data, and 
Hoist and Irle (2001) provided formulae for the limit of the error rate in the 
case of dependent data. Related research includes that of Gyorfi (1978, 1981) 
and Gyorfi and Gyorfi (1978), who investigated the rate of convergence to 
the Bayes risk when k tends to infinity as T increases. 

In the case of classifiers based on second-order kernel density estimators, 
and for populations with twice-differentiable densities, the risk typically con- 
verges to the Bayes risk at rate n~^/('^+^\ where d denotes the number of 
dimensions. See, for example, Kharin (1982), Raudys and Young (2004) and 
Hall and Kang (2005). In a minimax sense that Marron (1983) makes precise, 
this rate is optimal. As we show in this paper, nearest-neighbor classifiers 
with Poisson or Binomial interpretations of sample size have the same prop- 
erty. 

Recent work on properties of classifiers focuses largely on deriving up- 
per and lower bounds to regret in cases where the classification problem is 
relatively difficult, for example, where the classification boundary is compar- 
atively unsmooth. Research of Audibert and Tsybakov (2005) and Kohler 
and Krzyzak (2006), for example, is in this category. The work of Mammen 
and Tsybakov (1999), which permits the smoothness of a classification prob- 
lem to be varied in the continuum, forms something of a bridge between the 
smooth case, which we treat, and the rough case. 

There is a literature on empirical choice of k; see, for example. Chapter 26 
of Devroye, Gyorfi and Lugosi (1996) and Sections 7.2 and 8.4 of Gyorfi et 
al. (2002). More generally, Devroye, Gyorfi and Lugosi (1996) explored the 
properties and features of nearest-neighbor methods in the setting of pattern 
recognition. Chapter 5 of that monograph gives a good guide to the literature 
in this setting. 
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2. Main results. 

2.1. Different interpretations of sample size. Assume we have m identi- 
cally distributed data X = {Xi, . . . ,Xm}, and n identically distributed data 
y = {Yi, . . . ,Yn}, all of them d-variate and mutually independent. Let the 
respective probability densities be / and g. Given a compact set TZ C W^, 
we wish to use the data to classify a new datum z € 7^ as coming from the 
X or Y population. Note that we do not assume / and g themselves to be 
compactly supported; the constraint is only that we confine attention to the 
problem of classifying new data that come from a given compact region TZ. 

In many instances the ratio of the sizes of the datasets is a good approx- 
imation to the ratio of the prior probabilities of observing the respective 
populations. We shall adopt this viewpoint, which raises the issue of how 
we should interpret m and n. Two models arise in a natural way: the Pois- 
son, where the individual sample sizes are Poisson-distributed and data are 
assigned randomly to one proportion or another, in proportion to the respec- 
tive likelihoods; and the Binomial, where the sum of the two training-sample 
sizes is deterministic but data are ascribed to populations in the same fash- 
ion as before. The Poisson case can be viewed as the result of taking a sample 
from a marked point process in M"^, and assigning marks in a way that re- 
flects prior probabilities; and the Binomial case is the result of conditioning 
on total sample size in the Poisson setting. 

In the sense that it avoids the conditioning step, the Poisson case is the 
more natural and has the greater degree of symmetry. Therefore, we take 
that as the basis for analysis, and tackle the Binomial model by reference 
to the solution in the Poisson case. 

In multi-population cases, the kth nearest-neighbor classifier would typi- 
cally be used to assign z to population j if that population accounted for the 
greatest number of data among the k values in the pooled dataset that are 
nearest to z. Our results apply directly to this case, provided we work within 
a compact region at each point of which the maximum value of the popula- 
tion densities is achieved by no more than two densities. Another straight- 
forward extension is to the case where distance is measured in a weighted 
Euclidean metric; we shall work only with the standard, unweighted form. 

2.2. Poisson model. Assume that Af = {Xi, X2, ...} and 3^ = {Yi, •• •} 
represent points of type X and type Y, respectively, in a two-type marked 
Poisson process, V, in M'^, with intensity function fif + ug, and respective 
probabilities 
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and 1 — ip{z) that a point of P at z is of type X or of type Y. In particular, 
the respective prior probabilities of the X and Y populations are ^/{^ + v) 
and v/[jjL + v). It will be assumed that / and g are held fixed, and that /x 
and V satisfy 

. ^ /i = ^{v) increases with in such a manner that ii/{fi + u) ^ p & 
^ ' (0, 1) as z> ^ oo. 

Define p = pf /{pf + (1 —p)g}, a function on M'^. 

Suppose too that the respective densities, / and g, of the X and Y pop- 
ulations, satisfy 

the set S QTZ, defined as the locus of points z for which p{z) = ^, 
is of codimension 1 and of finite measure in d — 1 dimensions. 



(2.3) 



(2.4) 



the distributions with densities / and g have finite second mo- 
ments; / and g are both continuous in an open set containing 
TZ, and both have two continuous derivatives within an open set 
containing S; and f + g > onTZ; 



The first part of (2.3) asks that <S be a (d — l)-dimensional structure — a set 
of isolated points if d = 1, a set of curves in the plane if d = 2, and so on. 

The assumption of two derivatives in (2.4) is to be expected since, as 
noted in Section 1, the convergence rate of regret that is achieved by nearest- 
neighbor methods is optimal under that smoothness assumption. The con- 
dition that the derivatives assumed in (2.4) are continuous is imposed only 
so that a concise asymptotic formula for regret can be given; see (2.8) be- 
low. Without the precision provided by the continuity assumption, we could 
state only an upper bound for regret, in which the right-hand side of (2.8) 
was replaced by 0{k~^ + (k/i^)^^'^}. 

We ask too that the slopes at which the two densities, weighted in propor- 
tion to their prior probabilities, meet along S be bounded away from zero 
along S. That is, the function 

f df(z) dg(z)]'^ 

o(z)^ = ^< p— (1 — p)—K — f is bounded away from zero on S 

j=i'^ ozj ) 

(2.5) 

Equivalently, the prior-weighted densities cross at an angle, rather than meet 
in a tangential way. If the prior- weighted densities were to have exactly equal 
gradients at crossing places, then there would be an explicit and intimate 
connection between the distributions of X and Y populations that could 
hardly arise by chance. It is difficult to envisage that perfect alignment of 
densities at crossing points would actually occur commonly in practice. 

Write dzQ for an infinitesimal element of S, centred at zq. Let Ud = 
7r'^/^/r(l -|- ^d) denote the content of the unit d-dimensional sphere, define 
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X = p{l—p) ^ f + g and 

(2.6) a{z) = -±-X(zr'~(V^^af/''d~'j2{p^(')^^(') + ^Pi.(^)A(^)}, 

where z = {z^^\ . . . , z^'^^), \j{z) and Pj{z) denote the first derivatives of the 
respective functions with respect to z^^\ and Pjj{z) is the second derivative 
of p{z) with respect to z^^\ Put p= {pi, . . . ,Pd)- 

Let $ denote the cumulative distribution function of the standard nor- 
mal distribution and let ^i(^) = — ~ p)9{z)}'^ p{z). It can 
be shown that, on S, a{z) = ^i(z) = 4/i(z)||p(z)||, where h{z) denotes the 
common value that pf{z) and (1 —p)g{z) assume at z S 5. Therefore, since 
assumptions (2.3)-(2.5) imply that a and h are bounded away from zero 
and infinity on S, they also ensure that ^i{z) and ||/o(z)|| are bounded away 
from zero and infinity there. It follows that the constants C\ and C2, given 
by 

Ci= / ^,{z,m\p{zo)\?)-'dzo = \f Pf\^dz,, 
Js 2Js\\p{zq)\\ 



^i{z,m\p{zo)\?r^a{z,fdzo = 2 I J!M^a{zofdzo, 



(2.7) 

^2 = 4 

are finite, that Ci is nonzero, and that C2 = if and only if a is identically 
zero on S. 

The Bayes classifier assigns z to the X or Y population according as 
ip{z) > I or 'i/j{z) < ^, respectively. Therefore, the Bayes risk for classification 
on TZ is 

riskBaycs= f minf ^'^ '^^ 



n \p + y p + u 



where, here and below, the superscript "Pois" will indicate that the setting 
of the Poisson model is being considered. The risk of the /c-nearest neighbor 
classifier, which assigns z to population X if at least of the k values of 
Poisson data nearest to z are from X, and to population Y otherwise, is 



riskP^j = — - — / f{z)P^°^^{z classified by A;-nn rule as type Y) dz 
p + i' Jn 

/ c;(z)pP°^'(z classified by k -nn rule as type X) dz. 
Jn 



+ 

p + v 

A proof of the following result is given in Section 4. 

Theorem 1. Assume the Poisson model, that (2.2)-(2.5) hold, and that 
1 < kiiy) < k2{v), where ki{u) jv^ 00 and /c2(i^) = 0{y^~'^^ for some < 
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e < 1 . Then, 



(2.8) riskP?^, - riskP°^,, = C,k~' + C2{kM'/' + o{k~' + {k/u)'/'}, 



uniformly in ki^i') <k< k2{v)- 

Result (2.8) implies that, provided a is not identically zero, the optimal 
k satisfies k^°l^ ^ const. To set (2.8) into context, we note that a 
general formula for the difference between the risk of an empirical classifier 
and the Bayes risk can be developed from the theory of "plug-in decisions" ; 
see Theorem 2.2, page 16, of Devroye, Gyorfi and Lugosi (1996), and The- 
orem 6.2, page 93, of Gyorfi et al. (2002). When specialized to the case of 
nearest-neighbor methods, this argument bounds the left-hand side of (2.8) 
by a constant multiple of {k^^ + (/c/z^)^/'^}^/^, the minimum order of which 
is Mammen and Tsybakov (1999) showed that, in the case where 

discrimination boundaries are smooth, substantially faster convergence rates 
are possible. Result (2.8) and its analogues in the setting of Theorem 2 give 
concise accounts of those faster rates in the case of nearest-neighbor meth- 
ods. 

Expansion (2.8) has a close analogue in the setting of second-order, kernel- 
based methods. See, for example, formulae (3) of Kharin (1982) and (A. 2) 
of Hall and Kang (2005). 

2.3. Binomial model. In the Poisson model we can think of the data 
as arriving in a stream (Zi, Li), (Z2, L2), . . . , where Z'i,.Z2,... comprise a 
Poisson process in W^, with intensity function /// -|- vg, and the "labels" Li 
form a sequence of zeros and ones, independent of one another conditional 
on the Zj's, with P{Li = | Zj) = ip{Zi) and ip defined by (2.1). If Lj = 0, 
then Zi is labeled as coming from the X-population, whereas if Lj = 1, then 
Zi is labeled as Y. Since the integral of the Poisson-process intensity over 
equals + v, then the number of points Zi equals a Poisson-distributed 
random variable, T, say, with mean ^jl + v. In the Binomial model we use 
the same process to generate data, but now we condition on T. 

It is convenient to think of T as m + n, where m = ^T/(/_f + v) and 
n = vT/[ii + u) are the respective average numbers of points that would 
occur in the two training samples if we were to adopt the procedure indicated 
above. (In particular, m and n are not necessarily integers.) In this notation 
the risk for the nearest-neighbor classifier under the Binomial model can be 
written as 




+ — / g{z)P ™(2; classified by fc-nn rule as type X) dz, 
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where we use the superscript Bin to indicate that we are samphng under 
the Binomial model. If we suppose that 

(2.9) ^ + v = T a nonrandom integer, 

then these manipulations are unnecessary, and so we shall assume (2.9) 
below. This condition also implies that the Bayes risk under the Binomial 
model, riske^ps, is identical to its counterpart under the Poisson model, and 
that helps to further simplify comparisons. 

Theorem 2. Assume the Binomial model, that (2.2)~(2.5) and (2.9) 
hold, and that ki and k2 satisfy the conditions imposed on them in Theo- 
rem 1. Then, 

(2.10) riskfiSn - riskP.f, = o{k-' + {kjvfl"}, 
uniformly in ki^v) <k< k2{v)- 

A proof of Theorem 2 is given in a longer version of this paper [Hall, Park 
and Samworth (2007)]. 

Formula (2.10) asserts that the difference between risk^™^ and risk^°nn is 
of smaller order than the difference between risk^™^ and riskB°yps [see (2.9) 
for the latter difference], and hence, implies that the expansion of regret at 
(2.8) is equally valid if risk^°^^ and riskg^ygg there are replaced by riskf ™„ 
and riskg^gg, respectively. 

2.4. Empirical choice of kopt- The theoretical results described earlier 
can be used to motivate practical methods for choosing k. We shall treat 
the Poisson model; the Binomial model can be addressed similarly. 

Let M and N be the respective sizes of the training samples X and 
3^. Generate M* and A^*, respectively, from the Poisson distributions with 
means equal to M and N . Let < r < 1. Draw bootstrap resamples X* 
and 3^*, of respective sizes 

Ml = [rM*], Nl = [rN*] 

from X and y. Here, [x\ denotes the integer part of x. This choice of 
and Nl implies that the total resample size equals r{M* + N*), except for 
rounding errors arising from taking integer parts. Note too that M^/{Ml + 
Nl) equals the sampling fraction M* /{M* + N*) (again modulo integer-part 
rounding). This is necessary if our bootstrap algorithm, based on repeated 
resamples of sizes and , is to mimic properties of the original sampling 
algorithm. 

Draw additional resamples <-ftcst ^^'^ ^test; respective sizes M* — 
and N* — from X and y. Build near- neighbor classifiers based on X* 
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and y* . Use them to classify the data A'^*st and 3^*cst> ^^'^ compute the 
resulting error rate. Average this rate over a large number of choices of 
{,^*, ,^^-*g^} and {3^*, 3^test}- Choose k = kopt to minimize the average error 
rate; it is an estimator of the value of koptirju^rv) that we would use if 
the true intensity function were + I'g), rather than /i/ + I'g. Convert 
fcopt to an empirical value, fcopt = 'r~'^/^^~^^^koptj that is of the right size for 
classification starting from the samples X and y. 

In the case of the binomial sample-size model, one may follow the same 
bootstrapping procedure as in the Poisson case, but generating M* from 
Binomial(M + A^, M/{M + N)) and taking N* = M + N - M* . 

3. Numerical properties. We present the results of a numerical experi- 
ment demonstrating the effectiveness of the empirical choice /copt introduced 
in Section 2. We simulated 500 training datasets from Poisson sample-size 
models for selected pairs of intensity constants Each dataset was ob- 

tained as follows. First, we generated a random number, say, M, from a 
Poisson distribution with mean fi + u. Then, we drew Af independent data 
from the density X{z) = {ij,f{z) + ug{z)}/{fi + v); let these be Zi, . . . , Z_\f. 
For 1 = 1,... ,J\f, we marked "type X" or "type Y" on Zi with respective 
probabilities ipiZi) = fj.f{Zi)/{fif{Zi) + ug{Zi)} and 1 — ip{Zi). An equiva- 
lent way of doing this would be to draw TV independent data, each of which 
is sampled from the density f or g with respective probabilities + v) 
and + z^). Each datum would then be marked "type X" if it was from 
/, and "type y" otherwise. 

We took (/i, z^) = (100,100) and (100,200) and considered the cases d = 
1,2. For d=l, we chose / to be the density function of A^(— 0.5,1) and 
g to be the density function of A^(0.5,l). For d = 2, we considered two 
pairs of densities. One was {f,g), where / ~ A^2((0.5, — 0.5), 12) and (7 ~ 
A^2((— 0.5, 0.5), 12). Here, Id is the dx d identity matrix. The other was a 
pair of bivariate normal densities, as in the first case but with correlation 
p = 0.5. 

For each z, we evaluated 

pP™'(z classified as type X) 

= 55q(# training samples that classify z as type X). 
The error rate was then estimated by the formula 

E^v = — ^ / /(z){l - P^°''{z classified as type X)}dz 
IJ' + i^ Jn 

+ — ^ / g{z)P^°'\z classified as type X) dz. 
pi + v Jn 

We took IZ = [—2.5, 2.5]^^, which covered most of the sampling region. To see 
the effect of the bootstrap resampling fraction on the performance of fcopt. 
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the three choices r= 1/3,1/2,2/3 were considered, where r was defined in 
Section 2.4. For computation of fcopt, 100 bootstrap resamples were drawn. 

Table 1 shows the estimated error rates of the /c-nearest neighbor classifier 
with kopt, and the /c-nearest neighbor classifier with /Copt, for each simulation 
setting. Here, k^pt denotes the value of the deterministic k that minimized 
the estimated error rate of the A;-nn classifier. The Monte Carlo sampling 
variability of the estimated error rates can be measured by 

s.e.(Err) = y^Err(l - Err)/500. 

It is seen that the empirical choice /copt works particularly well. The error 
rates of the k-nn classifier with /copt are not far from the error rate of the 
corresponding classifier with /copt- The interval Err ± s.e.(Err), where Err 
is the estimated error rate of the A:-nn classifier with /copt; contains the 
optimal error rate achieved by the corresponding classifier with feopt, except 
in the correlated case with (fi, u) = (100, 200). For the latter case, confidence 
intervals with two standard errors include the corresponding optimal value. 
Overall, the subsampling fraction r = 1/3 gave the best results. However, 
the error rate does not change much for different choices of r; the differences 
are not statistically significant. This suggests that fcopt may not be sensitive 
to the choice of the resampling fraction. In the simulations we tried other 
populations with different mean vectors and covariance matrices. Also, we 
tried other training sample sizes. The lessons that we learned from the other 
simulation settings were basically the same as those obtained from Table 1. 

Table 1 also suggests that the optimal choice fcopt for the case fi^y tends 
to be smaller than the one for ^ = v. Our theory for the rate of fcopt also 
was evident empirically. For example, we found that fcopt changed from 27 
to 71 when (/i, z^) increased from (100,200) to (400,800) in the case corre- 
sponding to the bottom row of Table 1. The rate of increase in this case 

Table 1 

Error rates of classifiers based on 500 training datasets from Poisson sample-size models 

with intensity X = ^f + vg, where f and g are densities of normal distributions as 
specified in the text. Here, r denotes the subsampling fraction that appears in Section 2.4 



d 




P 


Bayes 


^opt 


fc-nn 
with 


r = 1/3 


fc-nn with fcopt 
r = 1/2 


r = 2/3 


1 


(100,100) 




0.3072 


103 


fcopt 

0.3119 


0.3119 


0.3118 


0.3120 




(100,200) 




0.2685 


61 


0.2735 


0.2759 


0.2784 


0.2814 


2 


(100,100) 





0.2371 


71 


0.2444 


0.2445 


0.2450 


0.2454 






0.5 


0.1566 


39 


0.1654 


0.1682 


0.1708 


0.1731 




(100,200) 





0.2125 


45 


0.2199 


0.2236 


0.2274 


0.2310 






0.5 


0.1430 


27 


0.1514 


0.1684 


0.1784 


0.1870 
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was 71/21 = 2.63, which was roughly consistent with the theoretical value 
^4/(2+4) =2.52. To obtain similar empirical evidence in higher-dimensional 
feature spaces, we considered a case with d = 16. We simulated 500 training 
datasets from Poisson sample-size models, with / and g being the den- 
sities of A^i6((0.25,...,0.25),/i6) and A^i6((-0.25, . . . , -0.25), lie), respec- 
tively, when (/i,zy) = (100,200) and (10000,20000). The relative increase of 
fcopt in this case was 61/25 = 2.44, which is not far from its theoretical value 
1004/(16+4) = 2.51. 

4. Proof of Theorem 1. Let denote the set of points in TZ that are 
distant no further than e > from S. Write TZ\S^ for the set of points in 
TZ that are not in . Using Markov's inequality, it can be shown that, for 
each fixed C, e > 0, we have, as — > oo, 

(4.1) pP°^'(z classified by A;-nn rule as type X) = /{V'(z) > \} + O^u'^), 

uniformly in z ^TZ\S^ . By letting e = e(i/) converge to zero sufficiently 
slowly in (4.1), we ensure that that result remains true for decreasingly 
small e. We need (4.1) only when C = 1, and for £{v) decreasing sufficiently 
slowly to zero. This version of (4.1) implies that 



In view of (4.1) and (4.2), properties of / and g away from 5^ do not af- 
fect the size of regret up to any polynomial order. Hence, there is no loss of 
generality in working with distributions for which / and g have two contin- 
uous derivatives on TV" , rather than simply on . This simplifies notation, 
and so we shall make the assumption below without further comment. 

Given z ^TZ, let Zj-j^), Z(-2), . . . denote the point locations of the marked 
point process V, ordered such that \\z — Z(i)|| < H^^ — ■^(2)11 ^ " " " ! let 
Z(2) , • • . represent particular values of , Z(2) , . . . , respectively; and put z = 

(^(1), . . . , Z(^i.-^) and Z = {Z(^i^, . . . , .^(fc)). Denote by 11^°'^ (i*, k) the probability, 
conditional on Z = z, that among the points . . . , Z(^i^^ there are at least 
|/c points with marks X. We may write 



(4.2) 




g{z)P^°'^{z classified by k -nn rule as type X) 





where Ji, . . . ,Jk are independent zero-one variables. 



(4.3) 



P{Ji = 1) =gi =V'(Z(i)), 
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To aid interpretation of (4.3), note that, since we are here conditioning on 
Z = z, P{Ji = 1) = P(J, = 1 I = 

Note that, uniformly in I < i < k £ [ki{i'),k2{i')], 

d 

(4.4) + I E E ^{(% - - ^)^''^]^nn{^) 

ii=ii2=i 

+ o{E\\Z(„^-zf), 

where (Z(j) — z)^^^ denotes the jth component of — z, '4'j{z) = {d/dz^^^)ip{z) 
and tpj^j^iz) = {d'^/dz^^^^dz^^^^)i;{z). To obtain (4.4), we have used (2.4), 
which imphes that, for sufficiently small e > 0, / and g have two continuous 
derivatives on TZ'^, the latter denoting the set of all points in M.'^ that are 
distant no further than e from some point in TZ. It follows from this result 
that, under (2.4), the probability that Zj = Zi{z) G TZ^ , for all 1 < i < ^2(1') 
and all z ^TZ, equals 1 — 0{u~'-') for all C > 0. This implies the Taylor ex- 
pansion of V'(Z(i)) that leads to (4.4), and, in combination with the moment 
condition in (2.4), ensures the correctness of the remainder term in (4.4). 
Under the conditions of Theorem 1, E\\Zi^i.-^ — z|p = 0{(A;/;/)^/'^}, and so 

(4.4) implies that 

k 

^{i?^(Z(,))-v^(z)} 
1=1 

k d 

(4.5) =^^i?(Z(,)-z)(^Vi(^) 

1=1 j=i 

+ 5 E E E ^{(% - ^)(%) - 'f}nn^nj.i^) 

i=ljl=lj2=l 

+ o{kik/i^f/''}. 

Since TZ is compact and the remainder in (4.5) is of the stated order for each 
z (zTZ, then the remainder is of that order uniformly in z. 

Writing T = {^/v)f + g and k{u,z) = /„■ ||„||<||u|| ''"(•z + ^) t^^^, we see that 
the density of Z^^^ — z at n is 

/.(^,z) = .r(z + n) ^"f.^'^^y"' e— 
U - 1 ! 
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Therefore, 

(4.6) -z) = u J ut{z + u)P{W{u, z)<k-l] du, 

i=l 

(4.7) ^{(% - - zf} = ^f uu^r{z + u)P{W{u, z)<k-l} du, 

i=l 

where the random variable W{u, z) is Poisson-distributed with mean ^^{u, z), 
and the integrals are over R"^. 

In (4.6) and (4.7) we shall make the change of variable 

( k ]^/'^ 

4.8 u = — V. 

li/adT{z) } 

If Si > is chosen so small that z^~^^^{i//A;2(;/)}^/'^ — > oo, then, with v defined 
by (4.8), and tj = u~'^^'^{i' /k2{i')}'^^'^ , we have, for all sufficiently large and 
for all \\v\\ >ti and all k^ [ki{u),k2{y)]. 

It follows that, for all sufficiently large and for all ||f || > ti, all A: G 
[ki{v),k2{i'y\ and each C > 0, 

(4.9) <^„,s.^CE\Wiu,z)-EWiu,z^^c. 



{EW{u,z)Yl^ 

where Ci > depends only on C. Here we have used the fact that W{u,z) 
is Poisson-distributed with a mean that is bounded below by 1 for ||f || > 
and large u. 

Combining (4.5), (4.6), (4.7) and (4.9), and noting that the distribution 
of W{u, z) is symmetric in u, we deduce that 

k 

Y{Ei;{Z^^)-i;{z)} 
1=1 

ipizfuiriz + u)- t{z)}P{W{u, z)<k-l}du 



(4.10) 

2 



u : ||ti||<t 



+ I u^-4){z)ut{z + u)P{W{u, z)<k-l}du 

u : \\v\\<ti 

+ o{k{k/uf/''], 
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uniformly in z € 7^. 

Writing f = (ti, . . . , t^)'^ , defining ip and f analogously, defining ip = (ipij), 
a dx d matrix, and taking T to be the set of v such that ||i;|| < ti, and T' 
to be the corresponding set of u, given by (4.8), we deduce from (4.10) that 

k 

^{i?V(%)-V'(^)} 



i=l 



V I {'iIj{z)'^ uu^ f{z) + \u ii){z)uT{z)} 



T' 



T' 



(4.11) 



X P{W{u, z)<k-l}du + o{k{k/vf/'^} 

X:(n(^'))'{V'i(^)r,-(z) + i^,,-(z)r(z)} 
.j=i J 



r 



X P{W{u, z)<k-l}du + o{k{k/uf/'^} 
{k/aMzmk/^^aM^)}^^' 

j2{v^^^f{Mz)r,{z) + l^j,{z)T{z)} 
■i=i 

X P{Wiu, z)<k-l}dv + o{k{k/uf/'^}, 

uniformly in z ^IZ. 

To control the value of P{W{u, z) < A; — 1} in (4.11), we shall use a normal 
approximation to the distribution of a Poisson random variable with large 
mean, and a crude bound to that distribution when the mean is small. 
Specifically, let Zc_ have a Poisson distribution with mean Q. Then, for each 
C > 0, there exists a constant Ci = Ci{C) > such that, whenever C ^ 0, 

(a) for C > 1, sup„^<,<^(l + \x\f\P{Z^ < C + ^'''x) - ^x)\ < 



(4.12) 



C7lC~^/^ and (b) for < C < 1, sup^>o(l + kl)^-P(^c > ^) ^ <^iC- 



Since k{u, z) = adT{z)\\u\\'^ {1 + 0{\\u\\'^)} as ||u|| 0, uniformly in z € 7^, 
then if u is given by (4.8), vK{u,z)=k\\v\\'^[l + 0{{k/vf/'^\\v\\^]], uniformly 
in z € 7^. It follows that 

k-l-EW{u,z) _ ky\l-\\vr) o .2/.,, ,,2.. 

+ 0{A:l/2(^/^)2/d||^||2+{d/2)^^-l/2||^||-d/2|^ 

Noting that 

(4.13) P{W{u,z)<k-l] = P 



W{u, z) - EW{u, z) ^k-l- EW{u, z) 



{varW^(n,z)}V2 - {EW{u,z)yl^ 
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using (4.12)(a) to produce an approximation to the right-hand side of (4.13) 
when k"^/'^ < \\v\\ < ti, and using (4.12)(b) for the same purpose when \\v\\ < 
k~^/'^, we deduce from (4.11) that 

k 

(4.14) ^{i^V(%) - V'(^)} = kik/uf/^a^iz) + oikik/ufl"}, 

i=l 

uniformly in z (zTZ, where 

ai(z)^{a,r(z)}-i"(2/'^) 



v\\<l 



(4.15) 



^(t;(^-))2{^,(z)T,-(.) + iVi,(^)r(^)} 



dv 



{a,r(z)}-^-(2/'^)d-i 



^{^j(z)Tj(z) + l'lpjj{z)T{z)} 



Lj = l 



the latter being identical to Oi(z), defined at (2.6), except that there, ^ and 
r are replaced by their respective limits, p and A. 

In our proofs throughout Section 4.1, it is convenient to work not with S 
but with the locus Si, of points zq such that 

^J'f{zo) + ^g{zo) 2' 



(4.16) 



[In this notation Soo = limi/->oo is the set of zq such that p{z(j) = i.] We 
shall suppress the subscript on Sjy^ however, instead showing at the end of 
the proof [see the argument below (4.23)] that the transition from S = 
to 5oo is elementary. 

We wish to develop an approximation to 



(4.17) 



K^g) . / ,(.)P-^(. classified by k-nn rule as type X) dz 



g{z)E{U^°''{Z,k)}dz. 



If we reinterpret Ji , . . . , as random variables with distributions depending 
on Z, independent conditional on Z, and satisfying P{Ji = 1\ Z) = ijj{Z(^i-^), 
then 



(4.18) 



E{U''-'^{Z,k)} = p(j2j,>lk^. 
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Let be the infinite line perpendicular to S at zQ) a-^d let u denote a 
point on T^^. Now, 7^^ has two "halves," one in the direction where ip{u) im- 
mediately increases above ^ as u is moved away from zqj and the other where 
'ip{u) immediately decreases below |. Call these 7^o+ and T^o-, respectively. 

Note that 7^^ = {zq + tip{zo) : — oo < t < oo} and 7^g+ = {zq + tipi^zo) : < f < 
oo}. 

Put fikiz) = Ei<k E{Ji), cTkizf = var(Ei<fc Ji), Wk{z) = {Ei<fc J^-|^k{z)}/ 
ak{z) and xi^) = I{tpiz) < Assume e J. and kiiuY^'^e — oo as — > oo. 
Then, 

(4.19) CFk{zf = \k{l+o{l)}, uniformly in z G 5^ 

as ^ oo. By (4.17) and (4.18), 

Kl{g)^K,{g)- f g{z){l - x){z) dz 

g{z)[P{Wk{z) > Wk{z)} - (1 - x){z)] dz, 
KUf)^KM)- I fiz)x{z)dz 

fiz)[P{Wk{z)<Wkiz)}-x{z)]dz, 



where 



Hence, 



(4.20) 



Wk{z) 



K" = 



^ K'M) + -^K'M 



^9{z) 



X 



'P{Wkiz)<Wk{z)}-xiz)]dz. 



In view of (4.19), a standard application of the nonuniform version of the 
Berry-Esseen theorem to the sum of independent random variables repre- 
sented by VFfc(z) implies that, for each C > 0, 

sup sup {l + \w\f\P{Wkiz)<w}-^{w)\=0{k~^/^). 

z£S^ ~OQ<'W<00 

Hence, (4.20) entails, for all C > 0, 



K" 



(4.21) 



fJ-fiz) - vg{z) 



jJL + V 



[^{wk{z)}-x{z)]dz 



+ 



fj-fiz) - fgjz) 



{l + \wk{z)\]-^ dz 
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Using (4.14), (4.15) and (4.19), it can be shown that, if we take z = 
zo + k~^/'^u, with zq^S and u given by zq + k~^/'^u G T^q, then 

-Wk{z) = {1 + o{l)]2k~^l^k[^{z) - \ + {k/iyf^aiiz) + o{{k/vf'''}] 



{l + o(l)}2 



■ d 

Y,u^^^jizo) + k^/\k/uf/''aiizo) 

xo{\\u\\+k^/^{k/ijf/'^} 



uniformly in z € 5^. Hence, writing Uzq = — zq, Uzq± = Tzq± — zq, and 
we obtain from (4.21) 

kK= f f {pf{zo)-il-p)g{zo)}^ 



X um-2{ij{zo)'^u + Ckaiizo)}] - I{u £Uzo-)) dudzo 

0(1+4) 

(4.22) =/ / {pf{zo)-{l-p)gizo)}^i^{zo)mzo)\r' 



X tm-2{\\i;{zo)\\t + Cfcai(zo)}] - I{t < 0)) dtdzo 

+ 0(1 + 4) 

= Ci{S) + C2{S)4 + o{l + 4), 

where, to obtain the second identity, we take u = tip{zo)/\\ip{zo)\\. In (4.22), 
Ci{S) and 6*2(5) have the definitions at (2.7), except that here S is inter- 
preted as the set of points zq for which (4.16) holds. 
Combining (4.2) and (4.22), we deduce that 

(4.23) riskPt,-riskP™^,3 = C,iS)k-' + C2{S){k/u)'/'' + o{k-' + {k/u)'/''}. 

Under the conditions assumed in Theorem 1, fi / (fi + u) ^ p as z/^oo, from 
which it follows that Ci(5) and 6*2(5) converge to the values they would 
take if we were to define S as the set of points zq for which, instead of 
(4.16), pf{zo)/{pf{zQ) + (1 —p)g{zo)} = ^. This is the definition used for S 
at (2.7). Note too that ip ^ p and r — s- A as v ^ oo, and that these limits 
arise in a very simple way. For example, t = {fj,/h')f + g converges to A = 
p{l — p)~^ f + g since fi/v — >p(l — p)~^; the functions / and g remain fixed. 
Since Ci{S) and 6*2(5) converge to their values at (2.7), then Theorem 1 
follows from (4.23). 

Acknowledgments. We are grateful to the reviewers for helpful com- 
ments. 
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